A Decision Tree-based Text Art Extraction Method without any Language-Dependent Text Attribute

نویسنده

  • Tetsuya SUZUKI
چکیده

Text based pictures called text art or ASCII art are often used in Web pages, email text and so on. They enrich expression in text data, but they can be noise for text processing and display of text. For example, they can be obstacle for text-to-speech software and natural language processing, and some of them lose their shape in small display devices. With Text art extraction methods, which detects the area of text art in a given text data, we can ignore text arts in text data or replace them with other strings. Because a text data may include one or more natural languages, it is desirable that text art extraction methods are language-independent. In this paper, we propose a decision tree-based text art extraction method without any language-dependent text attribute. Our method uses attributes of a given text data which represent how the text data looks like text art while previously proposed methods use attributes of a given text data which represent how the text data looks like a specific language text. We tested 63 combinations of 7 text attributes including language-dependent attributes and language-independent attributes for text art recognition. The results shows that a combination of language-independent attributes is the best for text art recognition. The attributes are an attribute based on data compression ratio by Run Length Encoding and two text attributes based on text size. We also evaluated the performance of our text art extraction method with the language-independent attributes by an experiment.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Non-Dictionary-Based Thai Word Segmentation Using Decision Trees

For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve...

متن کامل

Normalisation and Analysis of Social Media Texts

We present a language-independent method for automatic diacritic restoration. The method focuses on low computational resource usage, making it suitable for mobile devices. We train a decision tree classifier on character-based features without involving a dictionary. Since our features require at most a few characters of context, this approach can be applied to very short text segments such as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006